library(tidyverse)
library(knitr)
library(broom)
library(stringr)
library(modelr)
library(forcats)
library(ggmap)
library(plotly)

options(digits = 3)
set.seed(1234)
theme_set(theme_minimal())

Overview

Geospatial visualizations are some of the oldest data visualization methods in human existence. Data maps were first popularized in the seventeenth century and have grown in complexity and detail since then. Consider Google Maps, the sheer volume of data depicted, and the analytical pathways available to its users.

Geometric visualizations are used to depict spatial features, and with the incorporation of data reveal additional attributes and information. The main features of a map are defined by its scale (the proportion between distances and sizes on the map), its projection (how the three-dimensional Earth is represented on a two-dimensional surface), and its symbols (how data is depicted and visualized on the map).1

Drawing maps in R is a layer-like process. Typically you start by drawing the boundaries of the geographic regions you wish to visualize, then you add additional layers defined by other variables:

  • Points
  • Symbols
  • Fills (choropleths)

Storing geospatial data

A geographic information system (GIS) is software that is “designed to capture, store, manipulate, analyze, manage, and present spatial or geographic data”.2 There are many specialized software packages for spatial data analysis, many of which are commercial or proprietary software. For serious spatial data analysis tasks, you probably want to learn how to use these products. However for casual usage, R has a number of tools for drawing maps, most importantly ggplot2.

Built-in packages

The maps package includes the map() function for drawing maps based on bundled geodatabases using the graphics package. maps includes basic boundary files for:

  • The world
  • United States
    • Continental USA
    • States
    • Counties
  • France
  • Italy
  • New Zealand

We can extract the boundary files using map_data(), which converts the data object to a data frame:

library(maps)

# world
map_data("world") %>%
  as_tibble()
## # A tibble: 99,338 x 6
##     long   lat group order region subregion
##  * <dbl> <dbl> <dbl> <int> <chr>  <chr>    
##  1 -69.9  12.5    1.     1 Aruba  <NA>     
##  2 -69.9  12.4    1.     2 Aruba  <NA>     
##  3 -69.9  12.4    1.     3 Aruba  <NA>     
##  4 -70.0  12.5    1.     4 Aruba  <NA>     
##  5 -70.1  12.5    1.     5 Aruba  <NA>     
##  6 -70.1  12.6    1.     6 Aruba  <NA>     
##  7 -70.0  12.6    1.     7 Aruba  <NA>     
##  8 -70.0  12.6    1.     8 Aruba  <NA>     
##  9 -69.9  12.5    1.     9 Aruba  <NA>     
## 10 -69.9  12.5    1.    10 Aruba  <NA>     
## # ... with 99,328 more rows
# usa
map_data("usa") %>%
  as_tibble()
## # A tibble: 7,243 x 6
##     long   lat group order region subregion
##  * <dbl> <dbl> <dbl> <int> <chr>  <chr>    
##  1 -101.  29.7    1.     1 main   <NA>     
##  2 -101.  29.7    1.     2 main   <NA>     
##  3 -101.  29.7    1.     3 main   <NA>     
##  4 -101.  29.6    1.     4 main   <NA>     
##  5 -101.  29.6    1.     5 main   <NA>     
##  6 -101.  29.6    1.     6 main   <NA>     
##  7 -101.  29.6    1.     7 main   <NA>     
##  8 -101.  29.6    1.     8 main   <NA>     
##  9 -101.  29.6    1.     9 main   <NA>     
## 10 -101.  29.6    1.    10 main   <NA>     
## # ... with 7,233 more rows
# states
map_data("state") %>%
  as_tibble()
## # A tibble: 15,537 x 6
##     long   lat group order region  subregion
##  * <dbl> <dbl> <dbl> <int> <chr>   <chr>    
##  1 -87.5  30.4    1.     1 alabama <NA>     
##  2 -87.5  30.4    1.     2 alabama <NA>     
##  3 -87.5  30.4    1.     3 alabama <NA>     
##  4 -87.5  30.3    1.     4 alabama <NA>     
##  5 -87.6  30.3    1.     5 alabama <NA>     
##  6 -87.6  30.3    1.     6 alabama <NA>     
##  7 -87.6  30.3    1.     7 alabama <NA>     
##  8 -87.6  30.3    1.     8 alabama <NA>     
##  9 -87.7  30.3    1.     9 alabama <NA>     
## 10 -87.8  30.3    1.    10 alabama <NA>     
## # ... with 15,527 more rows

map_data() converts the geodatabases into data frames where each row is a single point on the map. The resulting data frame contains the following variables:

  • long - longitude. Things to the west of the prime meridian are negative
  • lat - latitude
  • order - this identifies the order ggplot() should follow to “connect the dots” and draw the borders
  • region and subregion identify what region or subregion a set of points surrounds
  • group - this is perhaps the most important variable in the data frame. ggplot() uses the group aesthetic to determine whether adjacent points should be connected by a line. If they are in the same group, then the points are connected. If not, then they are not connected. This is the same basic principle for standard geom_line() plots:

    library(gapminder)
    
    # no group aesthetic
    ggplot(gapminder, aes(year, lifeExp)) +
      geom_line()

    # with grouping by country
    ggplot(gapminder, aes(year, lifeExp, group = country)) +
      geom_line()

    Note that group is not the same thing as region or subregion. If a region contains landmasses that are discontiguous, there should be multiple groups to properly draw the region:

    map("state", "michigan")

Shapefiles

maps contains a very limited number of geodatabases. If you want to import a different country’s borders or some other geographic information, you will likely find the data in a shapefile. This is a special file format that encodes points, lines, and polygons in geographic space. Files appear with a .shp extension, sometimes with accompanying files ending in .dbf and .prj.

  • .shp stores the geographic coordinates of the geographic features (e.g. country, state, county)
  • .dbf stores data associated with the geographic features (e.g. unemployment rate, crime rates, percentage of votes cast for Donald Trump)
  • .prj stores information about the projection of the coordinates in the shapefile (we’ll handle this shortly)

There are a crap ton of packages for R that allow you to interact with shapefiles and spatial data. Let’s start with a shapefile for state boundaries in the United States.3 We’ll use readOGR() from the rgdal package to read in the data file:

library(rgdal)

usa <- readOGR("data/census_bureau/cb_2013_us_state_20m/cb_2013_us_state_20m.shp")
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/soltoffbc/Projects/Data Visualization/dataviz/notes/data/census_bureau/cb_2013_us_state_20m/cb_2013_us_state_20m.shp", layer: "cb_2013_us_state_20m"
## with 52 features
## It has 9 fields
## Integer64 fields read as strings:  ALAND AWATER
str(usa, max.level = 2)
## Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots
##   ..@ data       :'data.frame':  52 obs. of  9 variables:
##   ..@ polygons   :List of 52
##   ..@ plotOrder  : int [1:52] 32 48 29 3 18 46 33 22 34 51 ...
##   ..@ bbox       : num [1:2, 1:2] -179.1 17.9 179.8 71.4
##   .. ..- attr(*, "dimnames")=List of 2
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot

This is decidedly not a tidy data frame. Once you import the shapefile, it’s best to convert it to a data frame for ggplot(). We can do this using fortify():

usa %>%
  fortify() %>%
  head()
##    long  lat order  hole piece id group
## 1 -88.5 31.9     1 FALSE     1  0   0.1
## 2 -88.5 31.9     2 FALSE     1  0   0.1
## 3 -88.5 31.9     3 FALSE     1  0   0.1
## 4 -88.5 31.9     4 FALSE     1  0   0.1
## 5 -88.5 32.0     5 FALSE     1  0   0.1
## 6 -88.5 32.0     6 FALSE     1  0   0.1

Under this approach, the id variable is just a number assigned to each region (in this case, each state/territory). However the shapefile contains linked data with attributes for each region. We can access this using the @data accessor:

usa@data %>%
  as_tibble
## # A tibble: 52 x 9
##    STATEFP STATENS  AFFGEOID    GEOID STUSPS NAME    LSAD  ALAND   AWATER 
##  * <fct>   <fct>    <fct>       <fct> <fct>  <fct>   <fct> <fct>   <fct>  
##  1 01      01779775 0400000US01 01    AL     Alabama 00    131172… 459492…
##  2 05      00068085 0400000US05 05    AR     Arkans… 00    134772… 295881…
##  3 06      01779778 0400000US06 06    CA     Califo… 00    403482… 204843…
##  4 09      01779780 0400000US09 09    CT     Connec… 00    125419… 181540…
##  5 12      00294478 0400000US12 12    FL     Florida 00    138897… 314136…
##  6 13      01705317 0400000US13 13    GA     Georgia 00    148962… 494780…
##  7 16      01779783 0400000US16 16    ID     Idaho   00    214045… 239773…
##  8 17      01779784 0400000US17 17    IL     Illino… 00    143793… 620168…
##  9 18      00448508 0400000US18 18    IN     Indiana 00    927895… 153667…
## 10 20      00481813 0400000US20 20    KS     Kansas  00    211752… 134671…
## # ... with 42 more rows

We can keep these variables in the new data frame through parameters to fortify(region = ""):

# state name
usa %>%
  fortify(region = "NAME") %>%
  head
##    long  lat order  hole piece      id     group
## 1 -88.5 31.9     1 FALSE     1 Alabama Alabama.1
## 2 -88.5 31.9     2 FALSE     1 Alabama Alabama.1
## 3 -88.5 31.9     3 FALSE     1 Alabama Alabama.1
## 4 -88.5 31.9     4 FALSE     1 Alabama Alabama.1
## 5 -88.5 32.0     5 FALSE     1 Alabama Alabama.1
## 6 -88.5 32.0     6 FALSE     1 Alabama Alabama.1
# FIPS code
usa %>%
  fortify(region = "STATEFP") %>%
  head
##    long  lat order  hole piece id group
## 1 -88.5 31.9     1 FALSE     1 01  01.1
## 2 -88.5 31.9     2 FALSE     1 01  01.1
## 3 -88.5 31.9     3 FALSE     1 01  01.1
## 4 -88.5 31.9     4 FALSE     1 01  01.1
## 5 -88.5 32.0     5 FALSE     1 01  01.1
## 6 -88.5 32.0     6 FALSE     1 01  01.1
# keep it all
(usa2 <- usa %>%
  fortify(region = "NAME") %>%
  as_tibble %>%
  left_join(usa@data, by = c("id" = "NAME")))
## # A tibble: 46,174 x 15
##     long   lat order hole  piece id     group   STATEFP STATENS  AFFGEOID 
##    <dbl> <dbl> <int> <lgl> <fct> <chr>  <fct>   <fct>   <fct>    <fct>    
##  1 -88.5  31.9     1 FALSE 1     Alaba… Alabam… 01      01779775 0400000U…
##  2 -88.5  31.9     2 FALSE 1     Alaba… Alabam… 01      01779775 0400000U…
##  3 -88.5  31.9     3 FALSE 1     Alaba… Alabam… 01      01779775 0400000U…
##  4 -88.5  31.9     4 FALSE 1     Alaba… Alabam… 01      01779775 0400000U…
##  5 -88.5  32.0     5 FALSE 1     Alaba… Alabam… 01      01779775 0400000U…
##  6 -88.5  32.0     6 FALSE 1     Alaba… Alabam… 01      01779775 0400000U…
##  7 -88.5  32.1     7 FALSE 1     Alaba… Alabam… 01      01779775 0400000U…
##  8 -88.4  32.2     8 FALSE 1     Alaba… Alabam… 01      01779775 0400000U…
##  9 -88.4  32.2     9 FALSE 1     Alaba… Alabam… 01      01779775 0400000U…
## 10 -88.4  32.2    10 FALSE 1     Alaba… Alabam… 01      01779775 0400000U…
## # ... with 46,164 more rows, and 5 more variables: GEOID <fct>,
## #   STUSPS <fct>, LSAD <fct>, ALAND <fct>, AWATER <fct>

Simple features

Simple features or simple feature access refers to a formal standard (ISO 19125-1:2004) that describes how objects in the real world can be represented in computers, with emphasis on the spatial geometry of these objects. It also describes how such objects can be stored in and retrieved from databases, and which geometrical operations should be defined for them.

The standard is widely implemented in spatial databases (such as PostGIS), commercial GIS (e.g., ESRI ArcGIS) and forms the vector data basis for libraries such as GDAL. A subset of simple features forms the GeoJSON standard.

R has well-supported classes for storing spatial data (sp) and interfacing to the above mentioned environments (rgdal, rgeos), but has so far lacked a complete implementation of simple features, making conversions at times convoluted, inefficient or incomplete. The package sf tries to fill this gap, and aims at succeeding sp in the long term.

What is a feature?

  • A thing or an object in the real world
  • Frequently consist of other objects
    • A set of features can form a single feature
    • i.e. a tree can be a feature, but a set of trees can form a forest which is itself a stand-alone feature
  • Features have a geometry describing where on Earth the feature is located
  • They have attributes, which describe other properties of the feature

Dimensions

All geometries are composed of points. Points are coordinates in a 2-, 3- or 4-dimensional space. All points in a geometry have the same dimensionality. In addition to X and Y coordinates, there are two optional additional dimensions:

  • a Z coordinate, denoting altitude
  • an M coordinate (rarely used), denoting some measure that is associated with the point, rather than with the feature as a whole (in which case it would be a feature attribute); examples could be time of measurement, or measurement error of the coordinates

The four possible cases then are:

  1. two-dimensional points refer to x and y, easting and northing, or longitude and latitude, we refer to them as XY
  2. three-dimensional points as XYZ
  3. three-dimensional points as XYM
  4. four-dimensional points as XYZM (the third axis is Z, fourth M)

Simple feature geometry types

The following seven simple feature types are the most common, and are for instance the only ones used for GeoJSON:

type description
POINT zero-dimensional geometry containing a single point
LINESTRING sequence of points connected by straight, non-self intersecting line pieces; one-dimensional geometry
POLYGON geometry with a positive area (two-dimensional); sequence of points form a closed, non-self intersecting ring; the first ring denotes the exterior ring, zero or more subsequent rings denote holes in this exterior ring
MULTIPOINT set of points; a MULTIPOINT is simple if no two Points in the MULTIPOINT are equal
MULTILINESTRING set of linestrings
MULTIPOLYGON set of polygons
GEOMETRYCOLLECTION set of geometries of any type except GEOMETRYCOLLECTION

Coordinate reference system

Coordinates can only be placed on the Earth’s surface when their coordinate reference system (CRS) is known; this may be an spheroid CRS such as WGS84, a projected, two-dimensional (Cartesian) CRS such as a UTM zone or Web Mercator, or a CRS in three-dimensions, or including time. Similarly, M-coordinates need an attribute reference system, e.g. a measurement unit.

Simple features in R

  • sf stores simple features as basic R data structures (lists, matrix, vectors, etc.)
  • Typical data structure stores geometric and feature attributes as a data frame
  • One row per feature
  • Since feature geometries are not single-valued, they are put in a list-column with each list element holding teh simple feature geometry of that feature
  • The three classes used to represent simple features are:
    • sf, the table (data.frame) with feature attributes and feature geometries, which contains
    • sfc, the list-column with the geometries for each feature (record), which is composed of
    • sfg, the feature geometry of an individual simple feature

Example: North Carolina

st_read(0 imports a shapefile and converts it to a simple feature data frame):

library(sf)
nc <- st_read(system.file("shape/nc.shp", package="sf"))
## Reading layer `nc' from data source `/Library/Frameworks/R.framework/Versions/3.4/Resources/library/sf/shape/nc.shp' using driver `ESRI Shapefile'
## Simple feature collection with 100 features and 14 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.3 ymin: 33.9 xmax: -75.5 ymax: 36.6
## epsg (SRID):    4267
## proj4string:    +proj=longlat +datum=NAD27 +no_defs

The short report printed gives the file name, the driver (ESRI Shapefile), mentions that there are 100 features (records, represented as rows) and 14 fields (attributes, represented as columns). This object is of class

class(nc)
## [1] "sf"         "data.frame"

meaning it extends (and “is” a) data.frame, but with a single list-column with geometries, which is held in the column with name

attr(nc, "sf_column")
## [1] "geometry"

If we print the first three features, we see their attribute values and an abridged version of the geometry

print(nc[9:15], n = 3)
## Simple feature collection with 100 features and 6 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.3 ymin: 33.9 xmax: -75.5 ymax: 36.6
## epsg (SRID):    4267
## proj4string:    +proj=longlat +datum=NAD27 +no_defs
## First 3 features:
##   BIR74 SID74 NWBIR74 BIR79 SID79 NWBIR79                       geometry
## 1  1091     1      10  1364     0      19 MULTIPOLYGON (((-81.5 36.2,...
## 2   487     0      10   542     3      12 MULTIPOLYGON (((-81.2 36.4,...
## 3  3188     5     208  3616     6     260 MULTIPOLYGON (((-80.5 36.2,...

In the output we see:

  • Each row is a simple feature: a single record, or data.frame row, consisting of attributes and geometry
  • The geometry column is a simple feature list-column (an object of class sfc, which is a column in the data.frame)
  • Each value in geometry is a single simple feature geometry (an object of class sfg)

Plotting methods for boundaries

maps

The maps package includes the map() function for drawing maps based on bundled geodatabases using the graphics package:

# map of the world
map()

# usa boundaries
map("usa")

map("state")

# county map of illinois
map("county", "illinois")

These are fine, but we’d rather use them with our friendly ggplot2 library. We can do this by converting the geodatabases into data frames for plotting with ggplot2 using the map_data() function we saw before.

ggplot2

map_data()

Let’s draw a map of the United States. First we need to store the USA map boundaries in a data frame:

usa <- map_data("usa") %>%
  as_tibble
usa
## # A tibble: 7,243 x 6
##     long   lat group order region subregion
##  * <dbl> <dbl> <dbl> <int> <chr>  <chr>    
##  1 -101.  29.7    1.     1 main   <NA>     
##  2 -101.  29.7    1.     2 main   <NA>     
##  3 -101.  29.7    1.     3 main   <NA>     
##  4 -101.  29.6    1.     4 main   <NA>     
##  5 -101.  29.6    1.     5 main   <NA>     
##  6 -101.  29.6    1.     6 main   <NA>     
##  7 -101.  29.6    1.     7 main   <NA>     
##  8 -101.  29.6    1.     8 main   <NA>     
##  9 -101.  29.6    1.     9 main   <NA>     
## 10 -101.  29.6    1.    10 main   <NA>     
## # ... with 7,233 more rows

Simple black map

  • We can use geom_polygon() to draw lines between points and close them up (connect the last point with the first point)
  • x = long and y = lat
  • We map the group aesthetic to the group column
ggplot() +
  geom_polygon(data = usa, aes(x = long, y = lat, group = group))

Ta-da! A few things we want to immediately start thinking about. First, because latitude and longitude have absolute relations to one another, we need to fix the aspect ratio so that we don’t accidentially compress the graph in one dimension. Fixing the aspect ratio also means that even if you change the outer dimensions of the graph (i.e. adjust the window size), then the aspect ratio of the graph itself remains unchanged. We can do this using the coord_fixed() function:

ggplot() +
  geom_polygon(data = usa, aes(x = long, y = lat, group = group)) +
  coord_fixed()

Now it looks a little squished. You can play around with the aspect ratio to find a better projection:4

ggplot() +
  geom_polygon(data = usa, aes(x = long, y = lat, group = group)) +
  coord_fixed(1.3)

Change the colors

Since this is a ggplot() object, we can change the fill and color aesthetics for the map:

ggplot() +
  geom_polygon(data = usa, aes(x = long, y = lat, group = group),
               fill = NA, color = "red") + 
  coord_fixed(1.3)

gg1 <- ggplot() + 
  geom_polygon(data = usa, aes(x = long, y = lat, group = group),
               fill = "violet", color = "blue") + 
  coord_fixed(1.3)
gg1

Always remember to use the group aesthetic

What happens if we plot the map without the group aesthetic?

ggplot() + 
  geom_polygon(data = usa, aes(x = long, y = lat),
               fill = "violet", color = "blue") + 
  coord_fixed(1.3)

Oops. The map doesn’t connect the dots in the correct order.

State maps

maps also comes with a state boundaries geodatabase:

states <- map_data("state") %>%
  as_tibble()
states
## # A tibble: 15,537 x 6
##     long   lat group order region  subregion
##  * <dbl> <dbl> <dbl> <int> <chr>   <chr>    
##  1 -87.5  30.4    1.     1 alabama <NA>     
##  2 -87.5  30.4    1.     2 alabama <NA>     
##  3 -87.5  30.4    1.     3 alabama <NA>     
##  4 -87.5  30.3    1.     4 alabama <NA>     
##  5 -87.6  30.3    1.     5 alabama <NA>     
##  6 -87.6  30.3    1.     6 alabama <NA>     
##  7 -87.6  30.3    1.     7 alabama <NA>     
##  8 -87.6  30.3    1.     8 alabama <NA>     
##  9 -87.7  30.3    1.     9 alabama <NA>     
## 10 -87.8  30.3    1.    10 alabama <NA>     
## # ... with 15,527 more rows

By default, each state is filled with the same color:

ggplot(data = states) + 
  geom_polygon(aes(x = long, y = lat, group = group), color = "white") + 
  coord_fixed(1.3)

We can adjust this by using the fill aesthetic. Here, let’s map region to fill:

ggplot(data = states) + 
  geom_polygon(aes(x = long, y = lat, fill = region, group = group), color = "white") + 
  coord_fixed(1.3) +
  # turn off color legend
  theme(legend.position = "none")

Here, each state is assigned a different color at random. You can start to imagine how we might build a choropleth by mapping a different variable to fill, but we’ll return to that in a little bit.

Plot a subset of states

We can use filter() to subset the states data frame and draw a map with only a subset of the states. For example, if we want to only graph states in the Midwest:

midwest <- subset(states, region %in% c("illinois", "indiana", "iowa",
                                        "kansas", "michigan", "minnesota",
                                        "missouri", "nebraska", "north dakota",
                                        "ohio", "south dakota", "wisconsin"))

ggplot(data = midwest) + 
  geom_polygon(aes(x = long, y = lat, group = group),
               fill = "palegreen", color = "black") + 
  coord_fixed(1.3)

Zoom in on Illinois and its counties

First let’s get the Illinois boundaries:

il_df <- filter(states, region == "illinois")

Now let’s get the accompanying counties:

counties <- map_data("county") %>%
  as_tibble
il_county <- filter(counties, region == "illinois")
il_county
## # A tibble: 1,696 x 6
##     long   lat group order region   subregion
##    <dbl> <dbl> <dbl> <int> <chr>    <chr>    
##  1 -91.5  40.2  561. 21883 illinois adams    
##  2 -90.9  40.2  561. 21884 illinois adams    
##  3 -90.9  40.1  561. 21885 illinois adams    
##  4 -90.9  39.8  561. 21886 illinois adams    
##  5 -90.9  39.8  561. 21887 illinois adams    
##  6 -91.4  39.8  561. 21888 illinois adams    
##  7 -91.4  39.8  561. 21889 illinois adams    
##  8 -91.4  39.9  561. 21890 illinois adams    
##  9 -91.5  39.9  561. 21891 illinois adams    
## 10 -91.4  39.9  561. 21892 illinois adams    
## # ... with 1,686 more rows

Plot the state first. This time, lets’ remove all the axes gridlines and background junk using theme_void():

il_base <- ggplot(data = il_df, mapping = aes(x = long, y = lat, group = group)) + 
  coord_fixed(1.3) + 
  geom_polygon(color = "black", fill = "gray")

il_base +
  theme_void()

Now let’s plot the county boundaries in white:

il_base +
  theme_void() + 
  geom_polygon(data = il_county, fill = NA, color = "white") +
  geom_polygon(color = "black", fill = NA)  # get the state border back on top

But what about Alaska and Hawaii?

If you were observant, you noticed map_data("states") only includes the 48 contiguous states in the United States. This is because Alaska and Hawaii exist far off from the rest of the states. What happens if you try to draw a map with them in it?

Yup, that doesn’t look right. Most maps of the United States place Alaska and Hawaii as insets to the south of California. Until recently, in R this was an extremely tedious task that required manually changing the latitude and longitude coordinates for these states to place them in the correct location. Fortunately a new package is available that has already done the work for you. fiftystater includes the fifty_states data frame which contains adjusted coordinates for Alaska and Hawaii to plot them with the mainland:

library(fiftystater)

data("fifty_states")
fifty_states %>%
  as_tibble
## # A tibble: 13,694 x 7
##     long   lat order hole  piece id      group    
##    <dbl> <dbl> <int> <lgl> <fct> <chr>   <fct>    
##  1 -85.1  32.0     1 FALSE 1     alabama Alabama.1
##  2 -85.1  31.9     2 FALSE 1     alabama Alabama.1
##  3 -85.1  31.9     3 FALSE 1     alabama Alabama.1
##  4 -85.1  31.8     4 FALSE 1     alabama Alabama.1
##  5 -85.1  31.8     5 FALSE 1     alabama Alabama.1
##  6 -85.1  31.7     6 FALSE 1     alabama Alabama.1
##  7 -85.1  31.7     7 FALSE 1     alabama Alabama.1
##  8 -85.1  31.7     8 FALSE 1     alabama Alabama.1
##  9 -85.1  31.6     9 FALSE 1     alabama Alabama.1
## 10 -85.0  31.6    10 FALSE 1     alabama Alabama.1
## # ... with 13,684 more rows
ggplot(data = fifty_states, mapping = aes(x = long, y = lat, group = group)) + 
  coord_fixed(1.3) + 
  geom_polygon(color = "black", fill = "gray")

From a shapefile

Recall how to import a shapefile and convert it to a data frame. Once we have it in that form:

glimpse(usa2)
## Observations: 46,174
## Variables: 15
## $ long     <dbl> -88.5, -88.5, -88.5, -88.5, -88.5, -88.5, -88.5, -88....
## $ lat      <dbl> 31.9, 31.9, 31.9, 31.9, 32.0, 32.0, 32.1, 32.2, 32.2,...
## $ order    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ hole     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS...
## $ piece    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ id       <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama"...
## $ group    <fct> Alabama.1, Alabama.1, Alabama.1, Alabama.1, Alabama.1...
## $ STATEFP  <fct> 01, 01, 01, 01, 01, 01, 01, 01, 01, 01, 01, 01, 01, 0...
## $ STATENS  <fct> 01779775, 01779775, 01779775, 01779775, 01779775, 017...
## $ AFFGEOID <fct> 0400000US01, 0400000US01, 0400000US01, 0400000US01, 0...
## $ GEOID    <fct> 01, 01, 01, 01, 01, 01, 01, 01, 01, 01, 01, 01, 01, 0...
## $ STUSPS   <fct> AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, A...
## $ LSAD     <fct> 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 0...
## $ ALAND    <fct> 131172434095, 131172434095, 131172434095, 13117243409...
## $ AWATER   <fct> 4594920201, 4594920201, 4594920201, 4594920201, 45949...

Now we can plot it like normal using ggplot():

ggplot(data = usa2, mapping = aes(x = long, y = lat, group = group)) + 
  coord_fixed(1.3) + 
  geom_polygon(color = "black", fill = "gray")

Because the data file comes from the Census Bureau, we also get boundaries for Alaska, Hawaii, and Puerto Rico. To remove them from the data, just use filter():

usa2 <- usa2 %>%
  filter(id != "Alaska", id != "Hawaii", id != "Puerto Rico")

ggplot(data = usa2, mapping = aes(x = long, y = lat, group = group)) + 
  coord_fixed(1.3) + 
  geom_polygon(color = "black", fill = "gray")

An sf object

Recall the North Carolina simple features dataset:

glimpse(nc)
## Observations: 100
## Variables: 15
## $ AREA      <dbl> 0.114, 0.061, 0.143, 0.070, 0.153, 0.097, 0.062, 0.0...
## $ PERIMETER <dbl> 1.44, 1.23, 1.63, 2.97, 2.21, 1.67, 1.55, 1.28, 1.42...
## $ CNTY_     <dbl> 1825, 1827, 1828, 1831, 1832, 1833, 1834, 1835, 1836...
## $ CNTY_ID   <dbl> 1825, 1827, 1828, 1831, 1832, 1833, 1834, 1835, 1836...
## $ NAME      <fct> Ashe, Alleghany, Surry, Currituck, Northampton, Hert...
## $ FIPS      <fct> 37009, 37005, 37171, 37053, 37131, 37091, 37029, 370...
## $ FIPSNO    <dbl> 37009, 37005, 37171, 37053, 37131, 37091, 37029, 370...
## $ CRESS_ID  <int> 5, 3, 86, 27, 66, 46, 15, 37, 93, 85, 17, 79, 39, 73...
## $ BIR74     <dbl> 1091, 487, 3188, 508, 1421, 1452, 286, 420, 968, 161...
## $ SID74     <dbl> 1, 0, 5, 1, 9, 7, 0, 0, 4, 1, 2, 16, 4, 4, 4, 18, 3,...
## $ NWBIR74   <dbl> 10, 10, 208, 123, 1066, 954, 115, 254, 748, 160, 550...
## $ BIR79     <dbl> 1364, 542, 3616, 830, 1606, 1838, 350, 594, 1190, 20...
## $ SID79     <dbl> 0, 3, 6, 2, 3, 5, 2, 2, 2, 5, 2, 5, 4, 4, 6, 17, 4, ...
## $ NWBIR79   <dbl> 19, 12, 260, 145, 1197, 1237, 139, 371, 844, 176, 59...
## $ geometry  <MULTIPOLYGON [°]> MULTIPOLYGON (((-81.5 36.2,..., MULTIPO...

We have information on each county in the state of North Carolina. To plot using ggplot2, make sure you have the development version of ggplot2 installed using devtools::install_github("tidyverse/ggplot2"). We can use geom_sf() to plot the shapefile:

ggplot() +
  geom_sf(data = nc)

This produces just the boundaries.

ggmap

Rather than relying on geodatabases or shapefiles which store boundaries as numeric data, we can use ggmap to retrieve raster map tiles from online mapping services.

library(ggmap)
get_googlemap("Saieh Hall For Economics, East 58th Street, Chicago, IL", zoom = 10) %>%
  ggmap()

This generates a ggplot() which uses the Google Maps image as a raster image, or background you can draw on top of.

leaflet

Leaflet is an open-source JavaScript library for drawing interactive maps. leaflet is a package for R that allows you to generate Leaflet maps using R functions and data objects. These maps are highly customizable and accept a wide range of data sources, including:

  • Data frames
  • Shapefiles
  • GeoJSON

leaflet integrates nicely into the tidyverse in that the map objects (or widgets) are built as a series of piped operations. The basic process is:

  1. Create a map widget using leaflet()
  2. Add layers (or features) to the map by using layer functions (e.g. addTiles, addMarkers, addPolygons)
  3. Build multiple layers as necessary
  4. Print the map widget to display it
library(leaflet)

m <- leaflet() %>%
  setView(lng = -87.596154, lat = 41.790115, zoom = 14)
m %>%
  addTiles()

Changing map projections

As you saw in The Truthful Art, representing portions of the globe on a flat surface can be challenging. Depending on how you project the map, you can distort or emphasize certain features of the map. Fortunately, ggplot() includes the coord_map() function which allows us to easily implement different projection methods.5 Depending on the projection method, you may need to pass additional arguments to coord_map() to define the standard parallel lines used in projecting the map:

ggplot(data = usa2, mapping = aes(x = long, y = lat, group = group)) + 
  geom_polygon(color = "black", fill = "gray") +
  coord_map() +
  ggtitle("Mercator projection (default)")

ggplot(data = usa2, mapping = aes(x = long, y = lat, group = group)) + 
  geom_polygon(color = "black", fill = "gray") +
  coord_map(projection = "albers", lat0 = 25, lat1 = 50) +
  ggtitle("Albers equal-area projection")

ggplot(data = map_data("world"), mapping = aes(x = long, y = lat, group = group)) + 
  geom_polygon(color = "black", fill = "gray") +
  coord_map(projection = "mollweide", xlim = c(-180, 180)) +
  ggtitle("Mollweide projection")

Adding data to the map

Points

Region boundaries serve as the background in geospatial data visualization - so now we need to add data. Some types of channels (points and symbols) are overlaid on top of the boundaries, whereas other channels (fill) are incorporated into the region layer itself. Let’s look at the first set of channels.

Points

Let’s use our usa2 map data to add some points. The airports data frame in the nycflights13 package includes geographic info on airports in the United States.

library(nycflights13)
airports
## # A tibble: 1,458 x 8
##    faa   name                   lat    lon   alt    tz dst   tzone        
##    <chr> <chr>                <dbl>  <dbl> <int> <dbl> <chr> <chr>        
##  1 04G   Lansdowne Airport     41.1  -80.6  1044   -5. A     America/New_…
##  2 06A   Moton Field Municip…  32.5  -85.7   264   -6. A     America/Chic…
##  3 06C   Schaumburg Regional   42.0  -88.1   801   -6. A     America/Chic…
##  4 06N   Randall Airport       41.4  -74.4   523   -5. A     America/New_…
##  5 09J   Jekyll Island Airpo…  31.1  -81.4    11   -5. A     America/New_…
##  6 0A9   Elizabethton Munici…  36.4  -82.2  1593   -5. A     America/New_…
##  7 0G6   Williams County Air…  41.5  -84.5   730   -5. A     America/New_…
##  8 0G7   Finger Lakes Region…  42.9  -76.8   492   -5. A     America/New_…
##  9 0P2   Shoestring Aviation…  39.8  -76.6  1000   -5. U     America/New_…
## 10 0S9   Jefferson County In…  48.1 -123.    108   -8. A     America/Los_…
## # ... with 1,448 more rows

Each airport has it’s geographic location encoded through lat and lon. To draw these points on the map, basically we draw a scatterplot with x = lon and y = lat. In fact we could simply do that:

ggplot(airports, aes(lon, lat)) +
  geom_point()

Let’s overlay it with the mapped state borders:

ggplot() + 
  coord_map() + 
  geom_polygon(data = usa2, mapping = aes(x = long, y = lat, group = group),
               color = "black", fill = "gray") +
  geom_point(data = airports, aes(x = lon, y = lat), shape = 1)

Slight problem. We have airports listed outside of the continental United States. There are a couple ways to rectify this. Unfortunately airports does not include a variable identifying state so the filter() operation is not that simple. The easiest solution is to crop the limits of the graph to only show the mainland:

ggplot() + 
  coord_map(xlim = c(-130, -60),
            ylim = c(20, 50)) + 
  geom_polygon(data = usa2, mapping = aes(x = long, y = lat, group = group),
               color = "black", fill = "gray") +
  geom_point(data = airports, aes(x = lon, y = lat), shape = 1)

If we want to change the projection method, the points will automatically adjust too:

ggplot() + 
  coord_map(projection = "albers", lat0 = 25, lat1 = 50,
            xlim = c(-130, -60),
            ylim = c(20, 50)) + 
  geom_polygon(data = usa2, mapping = aes(x = long, y = lat, group = group),
               color = "black", fill = "gray") +
  geom_point(data = airports, aes(x = lon, y = lat), shape = 1)

Symbols

We can change the size or type of symbols on the map. For instance, we can draw a bubble plot (also known as a proportional symbol map) and encode the altitude of the airport through the size channel:

ggplot() + 
  coord_map(xlim = c(-130, -60),
            ylim = c(20, 50)) + 
  geom_polygon(data = usa2, mapping = aes(x = long, y = lat, group = group),
               color = "black", fill = "white") +
  geom_point(data = airports, aes(x = lon, y = lat, size = alt),
             fill = "grey", color = "black", alpha = .2) +
  theme_void() +
  theme(legend.position = "none")

Circle area is proportional to the airport’s altitude (in feet). Or we could scale it based on the number of arriving flights in flights:

airports_n <- flights %>%
  count(dest) %>%
  left_join(airports, by = c("dest" = "faa"))

ggplot() + 
  coord_map(xlim = c(-130, -60),
            ylim = c(20, 50)) + 
  geom_polygon(data = usa2, mapping = aes(x = long, y = lat, group = group),
               color = "black", fill = "white") +
  geom_point(data = airports_n, aes(x = lon, y = lat, size = n),
             fill = "grey", color = "black", alpha = .2) +
  theme_void() +
  theme(legend.position = "none")

airports contains a list of virtually all commercial airports in the United States. However flights only contains data on flights departing from New York City airports (JFK, LaGuardia, or Newark) and only services a few airports around the country.

Fill (choropleths)

Choropleth maps encode information by assigning shades of colors to defined areas on a map (e.g. countries, states, counties, zip codes). There are lots of ways to tweak and customize these graphs, which is generally a good idea because remember that color is one of the harder-to-decode channels.

With shapefiles

We’ll continue to use the usa2 shapefile. Let’s reload it and also load and tidy the county shapefile:

usa <- readOGR("data/census_bureau/cb_2013_us_state_20m/cb_2013_us_state_20m.shp")
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/soltoffbc/Projects/Data Visualization/dataviz/notes/data/census_bureau/cb_2013_us_state_20m/cb_2013_us_state_20m.shp", layer: "cb_2013_us_state_20m"
## with 52 features
## It has 9 fields
## Integer64 fields read as strings:  ALAND AWATER
usa2 <- usa %>%
  fortify(region = "GEOID") %>%
  as_tibble %>%
  left_join(usa@data, by = c("id" = "GEOID")) %>%
  # filter out Alaska, Hawaii, Puerto Rico via FIPS codes
  filter(!(STATEFP %in% c("02", "15", "72")))

counties <- readOGR("data/census_bureau/cb_2013_us_county_20m/cb_2013_us_county_20m.shp")
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/soltoffbc/Projects/Data Visualization/dataviz/notes/data/census_bureau/cb_2013_us_county_20m/cb_2013_us_county_20m.shp", layer: "cb_2013_us_county_20m"
## with 3221 features
## It has 9 fields
## Integer64 fields read as strings:  ALAND AWATER
counties2 <- counties %>%
  fortify(region = "GEOID") %>%
  as_tibble %>%
  left_join(counties@data, by = c("id" = "GEOID")) %>%
  # filter out Alaska, Hawaii, Puerto Rico via FIPS codes
  filter(!(STATEFP %in% c("02", "15", "72")))

ggplot(counties2, mapping = aes(x = long, y = lat, group = group)) + 
  geom_polygon(color = "black", fill = "gray") +
  coord_map()

We’ll draw choropleths for the number of foreign-born individuals in each region (state or county). We can get those files from the census_bureau folder. Let’s also normalize our measure by the total population to get the rate of foreign-born individuals in the population:

(fb_state <- read_csv("data/census_bureau/ACS_13_5YR_B05012_state/ACS_13_5YR_B05012.csv") %>%
  mutate(rate = HD01_VD03 / HD01_VD01))
## # A tibble: 51 x 10
##    GEO.id GEO.id2 `GEO.display-la… HD01_VD01 HD02_VD01 HD01_VD02 HD02_VD02
##    <chr>  <chr>   <chr>                <int> <chr>         <int>     <int>
##  1 04000… 01      Alabama            4799277 <NA>        4631045      2881
##  2 04000… 02      Alaska              720316 <NA>         669941      1262
##  3 04000… 04      Arizona            6479703 <NA>        5609835      7725
##  4 04000… 05      Arkansas           2933369 <NA>        2799972      2568
##  5 04000… 06      California        37659181 <NA>       27483342     30666
##  6 04000… 08      Colorado           5119329 <NA>        4623809      5778
##  7 04000… 09      Connecticut        3583561 <NA>        3096374      5553
##  8 04000… 10      Delaware            908446 <NA>         831683      2039
##  9 04000… 11      District of Col…    619371 <NA>         534142      2017
## 10 04000… 12      Florida           19091156 <NA>       15392410     16848
## # ... with 41 more rows, and 3 more variables: HD01_VD03 <int>,
## #   HD02_VD03 <int>, rate <dbl>
(fb_county <- read_csv("data/census_bureau/ACS_13_5YR_B05012_county/ACS_13_5YR_B05012.csv") %>%
  mutate(rate = HD01_VD03 / HD01_VD01))
## # A tibble: 3,143 x 10
##    GEO.id GEO.id2 `GEO.display-la… HD01_VD01 HD02_VD01 HD01_VD02 HD02_VD02
##    <chr>  <chr>   <chr>                <int>     <int>     <int>     <int>
##  1 05000… 01001   Autauga County,…     54907        NA     54019       277
##  2 05000… 01003   Baldwin County,…    187114        NA    180308       802
##  3 05000… 01005   Barbour County,…     27321        NA     26517       145
##  4 05000… 01007   Bibb County, Al…     22754        NA     22486        86
##  5 05000… 01009   Blount County, …     57623        NA     55122       262
##  6 05000… 01011   Bullock County,…     10746        NA     10170       484
##  7 05000… 01013   Butler County, …     20624        NA     20467        91
##  8 05000… 01015   Calhoun County,…    117714        NA    114892       392
##  9 05000… 01017   Chambers County…     34145        NA     33786       101
## 10 05000… 01019   Cherokee County…     26034        NA     25849       106
## # ... with 3,133 more rows, and 3 more variables: HD01_VD03 <int>,
## #   HD02_VD03 <int>, rate <dbl>

Joining the data to regions

Now that we have our data, we want to draw it on the map. To do that, we have to join together our data sources - the shapefiles and the CSVs. Normally joining data files requires a _join() operation of some sort. However when using ggplot2, we don’t have to do this. Remember that we can pass different data frames into different layers of a ggplot() object. Rather than using geom_polygon() to draw our maps, now we switch to geom_map():

ggplot(fb_state, aes(map_id = GEO.id2)) +
  geom_map(aes(fill = rate), map = usa2) +
  expand_limits(x = usa2$long, y = usa2$lat)

Let’s break down what just happened:

  • fb_state is the data frame with the variables we want to visualize
  • map_id = GEO.id2 identifies the column in fb_state that uniquely matches each observation to a region in usa2
  • geom_map(aes(fill = rate), map = usa2)
    • fill = rate identifies the column in fb_state we will use to determine the color of each region
    • map = usa2 is the data frame containing the boundary coordinates
  • expand_limits(x = usa2$long, y = usa2$lat) ensures the graph is drawn to the proper window. Because the default data frame for this ggplot() object is fb_state, it won’t contain the necessary information to size the window

We can then tweak this up by adding a title, removing the background (but retaining the legend), and projecting the map using a different method:

ggplot(fb_state, aes(map_id = GEO.id2)) +
  geom_map(aes(fill = rate), map = usa2) +
  expand_limits(x = usa2$long, y = usa2$lat) +
  scale_fill_continuous(labels = scales::percent) +
  labs(title = "Rate of foreign-born individuals in the population",
       fill = NULL) +
  ggthemes::theme_map() +
  coord_map(projection = "albers", lat0 = 25, lat1 = 50)

We could do the same thing for the county-level data:

ggplot(fb_county, aes(map_id = GEO.id2)) +
  geom_map(aes(fill = rate), map = counties2) +
  expand_limits(x = counties2$long, y = counties2$lat) +
  scale_fill_continuous(labels = scales::percent) +
  labs(title = "Rate of foreign-born individuals in the population",
       fill = NULL) +
  ggthemes::theme_map() +
  coord_map(projection = "albers", lat0 = 25, lat1 = 50)

sf

If we want to plot feature attributes, the attributes are already joined with the geometric features. Just use the fill aesthetic:

ggplot() +
  geom_sf(data = nc, aes(fill = BIR74))

Because nc is a data frame, we can also apply basic reshaping functions to it. For instance, what if we want a facet plot of the SID rate in 1974 and 1979?

(nc2 <- nc %>%
  select(SID74, SID79, geometry) %>%
  gather(VAR, SID, -geometry))
## Simple feature collection with 200 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.3 ymin: 33.9 xmax: -75.5 ymax: 36.6
## epsg (SRID):    4267
## proj4string:    +proj=longlat +datum=NAD27 +no_defs
## First 10 features:
##      VAR SID                       geometry
## 1  SID74   1 MULTIPOLYGON (((-81.5 36.2,...
## 2  SID74   0 MULTIPOLYGON (((-81.2 36.4,...
## 3  SID74   5 MULTIPOLYGON (((-80.5 36.2,...
## 4  SID74   1 MULTIPOLYGON (((-76 36.3, -...
## 5  SID74   9 MULTIPOLYGON (((-77.2 36.2,...
## 6  SID74   7 MULTIPOLYGON (((-76.7 36.2,...
## 7  SID74   0 MULTIPOLYGON (((-76 36.3, -...
## 8  SID74   0 MULTIPOLYGON (((-76.6 36.3,...
## 9  SID74   4 MULTIPOLYGON (((-78.3 36.3,...
## 10 SID74   1 MULTIPOLYGON (((-80 36.3, -...
ggplot() +
  geom_sf(data = nc2, aes(fill = SID)) +
  facet_wrap(~ VAR, ncol = 1)

Selecting color palettes

Recall that Cleveland and McGill identify the color channel as one of the most difficult channels for humans to properly decode and interpret. Selection of your color palette is perhaps the most important decision to make when drawing a choropleth.

By default, ggplot2 picks evenly spaced hues around the Hue-Chroma-Luminance (HCL) color space:6

# generate simulated data points
sim_points <- data_frame(x = factor(1:6))

plots <- purrr::map(1:6, ~ ggplot(sim_points[1:.x, ], aes(x, x, color = x)) +
  geom_point(size = 5) +
    ggtitle(paste(.x, "color")) +
  theme(legend.position = "none"))

gridExtra::marrangeGrob(plots, nrow = 2, ncol = 3, top = NULL)

ggplot2 gives you many different ways of defining and customizing your scale_color_ and scale_fill_ palettes, but will not tell you if they are optimal for your specific usage in the graph.

RColorBrewer

Color Brewer is a diagnostic tool for selecting optimal color palettes for maps with discrete variables. The authors have generated different color palettes designed to make differentiating between categories easy depending on the scaling of your variable. All you need to do is define the number of categories in the variable, the nature of your data (sequential, diverging, or qualitative), and a color scheme. There are also options to select palettes that are colorblind safe, print friendly, and photocopy safe. Depending on the combination of options, you may not find any color palette that matches your criteria. In such a case, consider reducing the number of data classes.

Sequential

fb_county %>%
  mutate(rate_cut = cut_number(rate, 6)) %>%
  ggplot(aes(map_id = GEO.id2)) +
  geom_map(aes(fill = rate_cut), map = counties2) +
  expand_limits(x = counties2$long, y = counties2$lat) +
  labs(title = "Rate of foreign-born individuals in the population",
       fill = NULL) +
  ggthemes::theme_map() +
  coord_map(projection = "albers", lat0 = 25, lat1 = 50) +
  scale_fill_brewer(palette = "BuGn")

fb_county %>%
  mutate(rate_cut = cut_number(rate, 6)) %>%
  ggplot(aes(map_id = GEO.id2)) +
  geom_map(aes(fill = rate_cut), map = counties2) +
  expand_limits(x = counties2$long, y = counties2$lat) +
  labs(title = "Rate of foreign-born individuals in the population",
       fill = NULL) +
  ggthemes::theme_map() +
  coord_map(projection = "albers", lat0 = 25, lat1 = 50) +
  scale_fill_brewer(palette = "YlGn")

fb_county %>%
  mutate(rate_cut = cut_number(rate, 6)) %>%
  ggplot(aes(map_id = GEO.id2)) +
  geom_map(aes(fill = rate_cut), map = counties2) +
  expand_limits(x = counties2$long, y = counties2$lat) +
  labs(title = "Rate of foreign-born individuals in the population",
       fill = NULL) +
  ggthemes::theme_map() +
  coord_map(projection = "albers", lat0 = 25, lat1 = 50) +
  scale_fill_brewer(palette = "Blues")

Qualitative

state_data <- data_frame(name = state.name,
                         region = state.region,
                         subregion = state.division,
                         abb = state.abb) %>%
  bind_cols(as_tibble(state.x77)) %>%
  # get id variable into data frame
  left_join(usa2 %>%
              select(id, NAME) %>%
              distinct,
            by = c("name" = "NAME")) %>%
  # remove Alaska and Hawaii
  na.omit

# set region base plot
region_p <- ggplot(state_data, aes(map_id = id)) +
  geom_map(aes(fill = region), map = usa2) +
  expand_limits(x = usa2$long, y = usa2$lat) +
  labs(fill = NULL) +
  ggthemes::theme_map() +
  coord_map(projection = "albers", lat0 = 25, lat1 = 50)
region_p

# try different color brewers
region_p +
  scale_fill_brewer(palette = "Paired")

region_p +
  scale_fill_brewer(palette = "Dark2")

region_p +
  scale_fill_brewer(palette = "Pastel2")

# set subregion base plot
subregion_p <- ggplot(state_data, aes(map_id = id)) +
  geom_map(aes(fill = subregion), map = usa2) +
  expand_limits(x = usa2$long, y = usa2$lat) +
  labs(fill = NULL) +
  ggthemes::theme_map() +
  coord_map(projection = "albers", lat0 = 25, lat1 = 50)
subregion_p

subregion_p +
  scale_fill_brewer(palette = "Paired")

subregion_p +
  scale_fill_brewer(palette = "Set1")

subregion_p +
  scale_fill_brewer(palette = "Pastel1")

Session Info

devtools::session_info()
##  setting  value                       
##  version  R version 3.4.3 (2017-11-30)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2018-05-22                  
## 
##  package      * version    date       source                             
##  assertthat     0.2.0      2017-04-11 CRAN (R 3.4.0)                     
##  backports      1.1.2      2017-12-13 CRAN (R 3.4.3)                     
##  base         * 3.4.3      2017-12-07 local                              
##  bindr          0.1.1      2018-03-13 CRAN (R 3.4.3)                     
##  bindrcpp     * 0.2.2      2018-03-29 CRAN (R 3.4.4)                     
##  broom        * 0.4.4      2018-03-29 CRAN (R 3.4.3)                     
##  cellranger     1.1.0      2016-07-27 CRAN (R 3.4.0)                     
##  class          7.3-14     2015-08-30 CRAN (R 3.4.3)                     
##  classInt       0.1-24     2017-04-16 CRAN (R 3.4.0)                     
##  cli            1.0.0      2017-11-05 CRAN (R 3.4.2)                     
##  codetools      0.2-15     2016-10-05 CRAN (R 3.4.3)                     
##  colorspace     1.3-2      2016-12-14 CRAN (R 3.4.0)                     
##  compiler       3.4.3      2017-12-07 local                              
##  crayon         1.3.4      2017-10-03 Github (gaborcsardi/crayon@b5221ab)
##  crosstalk      1.0.0      2016-12-21 CRAN (R 3.4.0)                     
##  data.table     1.10.4-3   2017-10-27 CRAN (R 3.4.2)                     
##  datasets     * 3.4.3      2017-12-07 local                              
##  DBI            0.8        2018-03-02 CRAN (R 3.4.3)                     
##  devtools       1.13.5     2018-02-18 CRAN (R 3.4.3)                     
##  digest         0.6.15     2018-01-28 CRAN (R 3.4.3)                     
##  dplyr        * 0.7.4      2017-09-28 CRAN (R 3.4.2)                     
##  e1071          1.6-8      2017-02-02 CRAN (R 3.4.0)                     
##  evaluate       0.10.1     2017-06-24 CRAN (R 3.4.1)                     
##  fiftystater  * 1.0.1      2016-11-14 CRAN (R 3.4.0)                     
##  forcats      * 0.3.0      2018-02-19 CRAN (R 3.4.3)                     
##  foreign        0.8-69     2017-06-22 CRAN (R 3.4.3)                     
##  gapminder    * 0.3.0      2017-10-31 CRAN (R 3.4.2)                     
##  geosphere      1.5-7      2017-11-05 CRAN (R 3.4.2)                     
##  ggmap        * 2.6.1      2016-01-23 CRAN (R 3.4.0)                     
##  ggplot2      * 2.2.1.9000 2018-05-18 Github (tidyverse/ggplot2@54de616) 
##  glue           1.2.0      2017-10-29 CRAN (R 3.4.2)                     
##  graphics     * 3.4.3      2017-12-07 local                              
##  grDevices    * 3.4.3      2017-12-07 local                              
##  grid           3.4.3      2017-12-07 local                              
##  gtable         0.2.0      2016-02-26 CRAN (R 3.4.0)                     
##  haven          1.1.1      2018-01-18 CRAN (R 3.4.3)                     
##  hms            0.4.2      2018-03-10 CRAN (R 3.4.3)                     
##  htmltools      0.3.6      2017-04-28 CRAN (R 3.4.0)                     
##  htmlwidgets    1.2        2018-04-19 cran (@1.2)                        
##  httpuv         1.3.6.2    2018-03-02 CRAN (R 3.4.3)                     
##  httr           1.3.1      2017-08-20 CRAN (R 3.4.1)                     
##  jpeg           0.1-8      2014-01-23 CRAN (R 3.4.0)                     
##  jsonlite       1.5        2017-06-01 CRAN (R 3.4.0)                     
##  knitr        * 1.20       2018-02-20 CRAN (R 3.4.3)                     
##  labeling       0.3        2014-08-23 CRAN (R 3.4.0)                     
##  lattice        0.20-35    2017-03-25 CRAN (R 3.4.3)                     
##  lazyeval       0.2.1      2017-10-29 CRAN (R 3.4.2)                     
##  leaflet      * 1.1.0      2017-02-21 CRAN (R 3.4.0)                     
##  lubridate      1.7.4      2018-04-11 CRAN (R 3.4.3)                     
##  magrittr       1.5        2014-11-22 CRAN (R 3.4.0)                     
##  mapproj        1.2.6      2018-03-29 CRAN (R 3.4.4)                     
##  maps         * 3.3.0      2018-04-03 CRAN (R 3.4.4)                     
##  maptools     * 0.9-2      2017-03-25 CRAN (R 3.4.0)                     
##  memoise        1.1.0      2017-04-21 CRAN (R 3.4.0)                     
##  methods      * 3.4.3      2017-12-07 local                              
##  mime           0.5        2016-07-07 CRAN (R 3.4.0)                     
##  mnormt         1.5-5      2016-10-15 CRAN (R 3.4.0)                     
##  modelr       * 0.1.1      2017-08-10 local                              
##  munsell        0.4.3      2016-02-13 CRAN (R 3.4.0)                     
##  nlme           3.1-137    2018-04-07 CRAN (R 3.4.4)                     
##  nycflights13 * 0.2.2      2017-01-27 CRAN (R 3.4.0)                     
##  parallel       3.4.3      2017-12-07 local                              
##  pillar         1.2.1      2018-02-27 CRAN (R 3.4.3)                     
##  pkgconfig      2.0.1      2017-03-21 CRAN (R 3.4.0)                     
##  plotly       * 4.7.1.9000 2018-05-17 Github (ropensci/plotly@efe5535)   
##  plyr           1.8.4      2016-06-08 CRAN (R 3.4.0)                     
##  png            0.1-7      2013-12-03 CRAN (R 3.4.0)                     
##  proto          1.0.0      2016-10-29 CRAN (R 3.4.0)                     
##  psych          1.8.3.3    2018-03-30 CRAN (R 3.4.4)                     
##  purrr        * 0.2.4      2017-10-18 CRAN (R 3.4.2)                     
##  R6             2.2.2      2017-06-17 CRAN (R 3.4.0)                     
##  Rcpp           0.12.16    2018-03-13 CRAN (R 3.4.4)                     
##  readr        * 1.1.1      2017-05-16 CRAN (R 3.4.0)                     
##  readxl         1.0.0      2017-04-18 CRAN (R 3.4.0)                     
##  reshape2       1.4.3      2017-12-11 CRAN (R 3.4.3)                     
##  rgdal        * 1.2-18     2018-03-17 CRAN (R 3.4.4)                     
##  rgeos        * 0.3-26     2017-10-31 CRAN (R 3.4.2)                     
##  RgoogleMaps    1.4.1      2016-09-18 CRAN (R 3.4.0)                     
##  rjson          0.2.15     2014-11-03 CRAN (R 3.4.0)                     
##  rlang          0.2.0.9001 2018-05-18 Github (r-lib/rlang@854174a)       
##  rmarkdown      1.9        2018-03-01 CRAN (R 3.4.3)                     
##  rprojroot      1.3-2      2018-01-03 CRAN (R 3.4.3)                     
##  rstudioapi     0.7        2017-09-07 CRAN (R 3.4.1)                     
##  rvest          0.3.2      2016-06-17 CRAN (R 3.4.0)                     
##  scales         0.5.0.9000 2018-05-18 Github (hadley/scales@d767915)     
##  sf           * 0.6-1      2018-03-22 CRAN (R 3.4.4)                     
##  shiny          1.0.5      2017-08-23 cran (@1.0.5)                      
##  sp           * 1.2-7      2018-01-19 CRAN (R 3.4.3)                     
##  stats        * 3.4.3      2017-12-07 local                              
##  stringi        1.1.7      2018-03-12 CRAN (R 3.4.3)                     
##  stringr      * 1.3.0      2018-02-19 CRAN (R 3.4.3)                     
##  tibble       * 1.4.2      2018-01-22 CRAN (R 3.4.3)                     
##  tidyr        * 0.8.0      2018-01-29 CRAN (R 3.4.3)                     
##  tidyverse    * 1.2.1      2017-11-14 CRAN (R 3.4.2)                     
##  tools          3.4.3      2017-12-07 local                              
##  udunits2       0.13       2016-11-17 CRAN (R 3.4.0)                     
##  units          0.5-1      2018-01-08 CRAN (R 3.4.3)                     
##  utils        * 3.4.3      2017-12-07 local                              
##  viridisLite    0.3.0      2018-02-01 CRAN (R 3.4.3)                     
##  withr          2.1.2      2018-05-18 Github (jimhester/withr@79d7b0d)   
##  xml2           1.2.0      2018-01-24 CRAN (R 3.4.3)                     
##  xtable         1.8-2      2016-02-05 CRAN (R 3.4.0)                     
##  yaml           2.1.18     2018-03-08 CRAN (R 3.4.4)

  1. See chapter 10 in The Truthful Art for a more detailed introduction to these features.

  2. Source: Wikipedia.

  3. Originally obtained from the Census Bureau.

  4. Or as we’ll see shortly, apply a map projection to the graph.

  5. This function replaces coord_fixed().

  6. Check out chapter 6.6.2 in ggplot2: Elegant Graphics for Data Analysis for a much more thorough explanation of the theory behind this selection process